Using a Reproducible, Integrated System of R and Microsoft Power BI® to Ease the Pain of Assessing Publication Metrics
Joshua J. Cook, M.S., ACRP-PM, CCRC
Andrews Research & Education Foundation (AREF)
Biography
Disclosure - I’ve been in data science for about 1 year, and this is was my first data science-related presentation.
2021 - B.S., Biomedical Science, University of West Florida
2023 - M.S., Clinical Research Management, Wake Forest University
2023 - ACRP-PM/CCRC, Association of Clinical Research Professionals (ACRP)
2024 - M.S., Data Science, University of West Florida
2025 - Entry into M.D./Ph.D. program
Publication Metrics
Publications in this talk refer to peer-reviewed literature from academic journals.
Publications are quantified in the form of publication data metrics.
These metrics can include:
Publication counts
Citation counts
Affiliation spread (via journals)
Journal impact factor (JIF), or 5-year JIF
Uses
Publication metric signify individual, team, and organization productivity and impact.
Used to make business decisions:
Promotions/tenure
Awards
Grant funding
Clinical study sponsorship
Problem
Research managers use this:
With something like this…
The Solution - easyPubMed
easyPubMed
An R package that interfaces with the Entrez Programming Utilities hosted by the National Center for Biotechnology Information (NCBI).
Author: Damiano Fantini, Ph.D.
Specialized version of the rentrez R package…
Two-step process:building queries using PubMed field tags, then retrieving records matching the queries from PubMed
Setup
# Create R project (Rproj), primary folder, working directoryif(!require("tidyverse")) install.packages("tidyverse")if(!require("easyPubMed")) install.packages("easyPubMed")if(!require("XML")) install.packages("XML")library(tidyverse) # Data wrangling library(easyPubMed) # Entrez interfacelibrary(XML) # Reading and creating XML docs
1. Understanding the Query
AnzQuery <-"Adam W Anz[AU]"# Author field tag (first or any order)AllAnzQuery <-"Adam W Anz[AU] OR Adam Anz[AU]"# Field tag combination with "AND" or "OR" syntaxAnzJournalQuery <-"Adam W Anz[AU] AND (American Journal of Sports Medicine[TA] OR Arthroscopy[TA]) "# Combining field tags - full list in the paperAnnoyingNameQuery <-"Christopher O\'Grady[AU]"
2. Retrieving Records
AnzQuery <-"Adam W Anz[AU]"# Previous queryAnzIDs <-get_pubmed_ids(AnzQuery)# Retrieving query matchesAnz_abstracts <-fetch_pubmed_data(pubmed_id_list = AnzIDs, format="abstract")# Using PMIDs to download article information (as abstract)print(Anz_abstracts[1:16])
2. Retrieving Records
[1] "1. Arthroscopy. 2023 Mar;39(3):728-729. doi: 10.1016/j.arthro.2022.11.030."
[2] ""
[3] "Editorial Commentary: Elbow Injury Results When Pediatric and Adolescent "
[4] "Throwing Athletes Throw as Hard as Possible, and Weighted Baseball Training "
[5] "Should Be Banned for Youth Athletes."
[6] ""
[7] "Anz AW(1)."
[8] ""
[9] "Author information:"
[10] "(1)Andrews Research & Education Foundation and Andrews Institute for "
[11] "Orthopaedics & Sports Medicine."
[12] ""
[13] "Comment on"
[14] " Arthroscopy. 2023 Mar;39(3):719-727."
[15] ""
[16] "We are in the middle of an epidemic involving pediatric and adolescent throwing "
2. Retrieving Records
Anz_xml <-fetch_pubmed_data(pubmed_id_list = AnzIDs, format="xml" )# Using PMIDs to download article information (XML)Anz_titles <-custom_grep( Anz_xml,"ArticleTitle", "char" )# Extracting XML-tagged data (Article Titles)print(Anz_titles[1:16])
2. Retrieving Records
[1] "Editorial Commentary: Elbow Injury Results When Pediatric and Adolescent Throwing Athletes Throw as Hard as Possible, and Weighted Baseball Training Should Be Banned for Youth Athletes."
[2] "Blood Flow Restriction Using a Pneumatic Tourniquet Is Not Associated With a Cellular Systemic Response."
[3] "Bone Marrow Aspirate Concentrate Is Equivalent to Platelet-Rich Plasma for the Treatment of Knee Osteoarthritis at 2 Years: A Prospective Randomized Trial."
[4] "The safety and efficacy of 2 anterior-inferior portals for arthroscopic repair of anterior humeral avulsion of the glenohumeral ligament: cadaveric comparison."
[5] "Association Between Passive Hip Range of Motion and Pitching Kinematics in High School Baseball Pitchers."
[6] "Platelet-Rich Plasma: Fundamentals and Clinical Applications."
[7] "Arthroscopic Subchondral Drilling Followed by Injection of Peripheral Blood Stem Cells and Hyaluronic Acid Showed Improved Outcome Compared to Hyaluronic Acid and Physiotherapy for Massive Knee Chondral Defects: A Randomized Controlled Trial."
[8] "Biologic Association Annual Summit: 2020 Report."
[9] "Elevation of Peripheral Blood CD34+ and Platelet Levels After Exercise With Cooling and Compression."
[10] "Mobilized Peripheral Blood Stem Cells are Pluripotent and Can Be Safely Harvested and Stored for Cartilage Repair."
[11] "Blood Flow Restriction Training Using the Delfi System Is Associated With a Cellular Systemic Response."
[12] "Chondral Lesions of the Knee: An Evidence-Based Approach."
[13] "The Effects of Body Mass Index on Softball Pitchers' Hip and Shoulder Range of Motion."
[14] "Lower Extremity Pain and Pitching Kinematics and Kinetics in Collegiate Softball Pitchers."
[15] "Bone Marrow Aspirate Concentrate Is Equivalent to PRP for the Treatment of Knee OA at 1 Year: Response."
[16] "Autologous thrombin preparations: Biocompatibility and growth factor release."
# Downloading ALL record information in XML format
Sorting Full Article Data
Anz_list <-articles_to_list(pubmed_data = Anz_download)# Sorting XML files into a list of article-specific informationAnz_df_list <-lapply(Anz_list, article_to_df, autofill =TRUE)# Extracting article-specific information from the list# Stored as a list of tidy dataframesAnz_full_list <-do.call(rbind, Anz_df_list)# Unnesting the list into one dataframe
Cleaning and Validation Tips
Include as many author aliases as possible inside the query
Use a differentiating variable to validate the data (i.e., is this the correct Adam Anz?)
Wrangle as needed with the tidyverse
Scale up with as many authors as needed for your organization/project
Reporting Options: Quarto® and Microsoft Power BI®
A Basic Quarto® Report
1. Quarto®
An open-source scientific and technical publishing system built into RStudio (i.e., the next generation of R Markdown)
6 Quick Steps:
In RStudio®, create a new Quarto® Document
Edit the YAML header to fit the needs of the report
Create an R code block for setup and data wrangling. The output of this block should not be included in the report
Create additional R code blocks to display tables and figures that tell the story of the data (using packages such as gt and ggplot2).
Add branding and make context using external text, images, and links
A proprietary business intelligence (BI) application developed by Microsoft. Three basic options for connecting our data to the Power BI desktop client:
Export the dataframe from R as a static data source (ex: xlsx, csv)
Save the RData file within RStudio and load it from a defined working directory within the Power BI desktop client
Run the entire R script within the Power BI desktop client (no dedicated IDE needed!)
Simple Microsoft Power BI® Report Steps
Transform the data using standard Microsoft Power Query® syntax
From the visualizations tab, edit report page settings to define the canvas size, background, and any other customizations
Choose from available visualizations within Microsoft Power BI® or external visualizations from the visualization store, and other scripting options (such as R or Python)
Once a visualization is chosen, drag fields (data columns) of interest from the data tab to the visualization or filter tabs.
Continue adding visualizations to tell the story of the data
Add branding and make context using external text, images, and links
Publish the final report to the Microsoft Power BI Online Service® to distribute to stakeholders
Conclusions
Publication metrics are increasingly being used to measure individual- and organization-level productivity and impact within academia and industry
Historically, publication metrics have not been the easiest thing to quantify and manage
Instead of manually obtaining this data, it is much more feasible to leverage the R programming language and various reporting systems to manage this data
In the future, this system should also capture article citation counts and journal impact factors to add more context to these metrics. Automation of this system (i.e., automated, timed data refreshes) using a third-party program should investigated as well
Contact Information
Joshua J. Cook, M.S., ACRP-PM, CCRC
Cell: (850)736-1801
Email: jcook0312@outlook.com (Email me for the full paper!)